Nonparametric feature selection for big data with MARS

Jaron Arbet

12/15/23

Motivation

Goal: build a flexible yet interpretable model like:

\[\begin{equation} \boldsymbol{y} = f(\boldsymbol{x}_1, ..., \boldsymbol{x}_P) + \boldsymbol{e} \end{equation}\]

General linear models (GLM), e.g. linear or logistic regression are interpretable and can be flexible, but you need to decide:

  1. Which predictors to include?
  2. For each predictor: linear or non-linear effect?
  3. Do predictors interact? If yes, 2-way, 3-way, …?

MARS (multivariate adaptive regression splines) automatically determines all of this for you 😃

Piecewise linear functions

MARS uses simple piecewise linear functions (“splines”) that can approximate complex relationships

  • Knots: points in \(\boldsymbol{X}\) where effect on \(\boldsymbol{Y}\) (slope) changes

Question: what are 3 things you need to determine when fitting a piecewise linear spline?

  1. Number of knots
  2. Location of knots
  3. How the slopes change at each knot (i.e. estimate separate slopes within each interval)
  • MARS automatically decides all 3!
  • Everingham, Y. L., J. Sexton, and Jimmy White. “An introduction to multivariate adaptive regression splines for the cane industry.” Proceedings of the 2011 Conference of the Australian Society of Sugar Cane Technologists. 2011.
  • https://jekel.me/2017/Fit-a-piecewise-linear-function-to-data/
  • https://flexbooks.ck12.org/cbook/ck-12-interactive-algebra-1-for-ccss/section/1.4/primary/lesson/piecewise-linear-functions-alg-1-ccss/

Hinge function

  • Main building block of MARS used to construct the piecewise linear functions
  • For predictor \(x\) and knot at \(x=t\), hinge fn has 2 parts (Hastie et al. 2009):

  • The hinge fn comes in a pair of terms: Left \((t-x)_+\) and Right \((x-t)_+\)
  • If MARS selects a given hinge fn, it is input to a GLM and estimates 2 coefficients: \(\beta_{Left}(t-x)_+ + \beta_{Right}(x-t)\)

Suppose knot \(t = 0.5\). Here is what the hinge \(\beta_{L}(t-x)_+ + \beta_{R}(x-t)\) looks like for various coeffients:

Why piecewise linear hinge functions?

  • Many fancier splines exist (e.g. cubic, smoothing, penalized, B-splines)
  • The piecewise linear hinges are generally much faster and easier to implement for big data
  • …the regression surface is built up parsimoniously, using nonzero components locally—only where they are needed. This is important, since one should “spend” parameters carefully in high dimensions, as they can run out quickly [“curse of dimensionality”]. The use of other basis functions such as polynomials, would produce a nonzero product everywhere, and would not work as well (Hastie et al. 2009)

MARS algorithm

Overview

  1. Forward: build a large # hinges that overfit the data
  2. Backward: use backward VS to prune the model
  3. Estimate the final coefficients in lm/glm

http://www.milbo.org/doc/earth-notes.pdf

Nonparametric?

  • Parametric models require the user to pre-determine all parameters of the model to be estimated.
  • Nonparametric models like MARS automatically determine how to parameterize the model (# and form); more flexible.

Forward step

  • \(\boldsymbol{M}\) = set of terms in model. Start with just an intercept \(\{1\}\).
  • \(\boldsymbol{C}\) = set of candidate hinge functions to add to model. Contains hinge functions at each observed value for each predictor (\(N * P * 2\) total terms):

\[\begin{equation} \boldsymbol{C} = \big\{(X_j - t)_+, (t - X_j)_+\big\} \\ t \in \{x_{1j}, x_{2j}, ..., x_{Nj}\}; \ j = 1,2, ..., P \end{equation}\]

“At each stage we consider all products of a candidate hinge in \(\boldsymbol{C}\) with a hinge in the model \(\boldsymbol{M}\). The product that decreases the residual error the most is added into the current model.” (Hastie et al. 2009)

Thus at each step, it’s possible to add:

  • Interaction term
  • New variable
  • New knot to an existing variable in the model

Example first 3 steps:

  • Stop once maximum number of terms is reached

Backwards step

  • Forward step purposely builds a large model that overfits
  • Backward step prunes the model to reduce overfitting:

The term whose removal causes the smallest increase in residual squared error is deleted from the model at each stage, producing an estimated best model \(f_\lambda\) of each size (number of terms) \(λ\) (Hastie et al. 2009)

Thus the best models of size 1, 2, …, nprune features are identified

Best? Measured by fast generalized cross validation (GCV) or more accurate but slower K-fold CV

GCV provides a convenient approximation to leave-one out cross-validation for linear models [without needing to split/resample/refit data](Hastie et al. 2009)

Tuning MARS

Potential tuning parameters:

  1. Max degree of interactions allowed (set to 1 for none).
  2. Number of retained terms in the final model (nprune)
  3. Max number of terms in the Forward step (nk)

Simplest tuning strategy:

  • Set degree to a moderate value like 5 and use default nk
    • GCV is used to automatically select nprune
    • If “Reached max number of terms” then increase nk
    • If many 5-way interactions, then increase degree

Medium tuning strategy:

  • Same as above but use K-fold CV to select nprune

Advanced tuning strategy:

  • Grid for degree, nk and use K-fold CV to optimize.

Example

  • Outcome = forced expiatory volume in liters (fev)
  • N = 654 youths age 3-19 from East Boston during 1970’s
GGally::ggpairs(fev, showStrips = TRUE)

Fitting MARS with earth R package

  • By default, uses GCV to select optimal number of terms
library(earth);
fit.gcv <- earth(
    formula = fev ~.,
    data = fev,
    degree = 5,
    keepxy = TRUE
    );
print(fit.gcv);
Selected 6 of 17 terms, and 4 of 4 predictors
Termination condition: Reached nk 21
Importance: height.inches, sexMale, age, smokeYes
Number of terms at each degree of interaction: 1 3 2
GCV 0.1487769    RSS 93.32458    GRSq 0.8024061    RSq 0.8098985
  • Selected all 4 predictors, using 6 total hinge functions
  • GRSq normalizes GCV from 0 to 1, similar to adjusted \(R^2\).
plot(fit.gcv, which = c(1));

Using 10-fold CV (5 repeats) to select the model size

set.seed(123);
fit.cv <- earth(
    formula = fev ~.,
    data = fev,
    degree = 5,
    keepxy = TRUE,
    pmethod = 'cv',
    nfold = 10,
    ncross = 5
    );
print(fit.cv);
Selected 5 of 17 terms, and 3 of 4 predictors (pmethod="cv")
Termination condition: Reached nk 21
Importance: height.inches, sexMale, age, smokeYes-unused
Number of terms at each degree of interaction: 1 3 1
GRSq 0.8013863  RSq 0.8074228  mean.oof.RSq 0.7912086 (sd 0.049)

pmethod="backward" would have selected:
    6 terms 4 preds,  GRSq 0.8024061  RSq 0.8098985  mean.oof.RSq 0.7877625
plot(fit.cv, which = 1);

  • Very similar GRSq/RSq as using the simpler GCV tuning method, but 1 less term and omits smoke status

Predictor effects

Let’s explore the predictor effects from fit.gcv:

Selected hinge functions and \(\hat{\beta}\):

                                    fev
(Intercept)                  2.74759130
h(65-height.inches)         -0.09188745
h(age-8)                     0.08327821
h(height.inches-65)*sexMale  0.24942931
h(height.inches-68)         -0.14391813
h(age-8)*smokeYes           -0.02834601

Variable importance scores:

evimp(fit.gcv);
              nsubsets   gcv    rss
height.inches        5 100.0  100.0
sexMale              4  46.1   46.5
age                  3  16.5   17.9
smokeYes             1   3.6    5.5

Partial dependence plots:

  • plotmo() can be used to plot the estimated effects
  • I prefer the pdp R package for making similar plots:

Extensions

  • MARS/earth can be used for continuous, count, binary or multinomial outcomes
  • In theory, MARS can handle missing values and time-to-event outcomes. However, I’m not aware of any R implementations that support this.
  • caret’s bagged MARS can improve prediction performance at the expense of interpretability

Summary

MARS automatically handles:

  • Feature engineering: what type of features to include? linear or non-linear, additive or interaction?
  • Feature selection: given a high-dimensional set of initial features, which should you include?

Other benefits:

  • Fast
  • Easy to tune: simplest approach has 0 tuning parameters
  • Interpretable
  • Handles mixed numeric/categorical predictors without needing further transformation (similar to tree-methods)
  • Robust to outliers in the predictors

Downsides?

  1. In my experience, although MARS is more interpretable, it generally has lower predictive performance compared to Random Forests. Bagging and/or random variable sets (like RF) might improve this, but needs further investigation.
  2. No statistical inference (p-values or confidence intervals). I have ideas for how to do this, talk with me if interested.
  3. Although MARS in theory can handle missing values or time-to-event outcomes, I’m not aware of any free software that supports this. In contrast, many packages for tree-based models supports missing values and time-to-event outcomes (e.g. randomForestSRC R package).

Questions?

References

  • “Notes on the earth package”
  • Original MARS paper: Friedman, Jerome H. “Multivariate adaptive regression splines.” The annals of statistics 19.1 (1991): 1-67.
    • A slightly more accessible Intro: Friedman, Jerome H., and Charles B. Roosen. “An introduction to multivariate adaptive regression splines.” Statistical methods in medical research 4.3 (1995): 197-217.
  • IMO, the Elements of Statistical Learning chapter on MARS is the best short introduction
Hastie, Trevor, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Vol. 2. Springer.